Data Partitioning and Feature Engineering

Author
Affiliation

Nakul R. Padalkar

Boston University

Published

September 29, 2023

Modified

November 4, 2025

1 Importing Data and Packages

The dplyr package is used for data manipulation and the ggplot2 package is used for data visualization. These are frequently used packages in machine learning and have a rich code library available on their website.

# Helper packages
library(dplyr)     # for data manipulation
library(ggplot2)   # for awesome graphics
library(gt)

1.1 Modeling Process packages

library(rsample)  # for resampling procedures
library(caret)    # for resampling and model training
library(h2o)      # for resampling and model training

1.1.1 h20 library

H2O is a fully open-source, distributed in-memory machine-learning platform with linear scalability. H2O supports the most widely used statistical & machine learning algorithms including gradient-boosted machines, generalized linear models, deep learning and more. H2O also has an industry-leading AutoML functionality that automatically runs through all the algorithms and their hyperparameters to produce a leaderboard of the best models. The H2O platform is used by over 18,000 organizations globally and is extremely popular in both the R & Python communities.

# h20 set up

h2o.no_progress() # turn off h20
h2o.init()        # launch h20

H2O is not running yet, starting it now...

Note:  In case of errors look at the following log files:
    /tmp/RtmpAUmhtW/file3e9c382a9e8/h2o_runner_started_from_r.out
    /tmp/RtmpAUmhtW/file3e9c73cda24/h2o_runner_started_from_r.err


Starting H2O JVM and connecting: .... Connection successful!

R is connected to the H2O cluster: 
    H2O cluster uptime:         1 seconds 642 milliseconds 
    H2O cluster timezone:       UTC 
    H2O data parsing timezone:  UTC 
    H2O cluster version:        3.44.0.3 
    H2O cluster version age:    1 year, 10 months and 15 days 
    H2O cluster name:           H2O_started_from_R_runner_mtg758 
    H2O cluster total nodes:    1 
    H2O cluster total memory:   3.91 GB 
    H2O cluster total cores:    4 
    H2O cluster allowed cores:  4 
    H2O cluster healthy:        TRUE 
    H2O Connection ip:          localhost 
    H2O Connection port:        54321 
    H2O Connection proxy:       NA 
    H2O Internal Security:      FALSE 
    R Version:                  R version 4.5.0 (2025-04-11) 
Warning in h2o.clusterInfo(): 
Your H2O cluster version is (1 year, 10 months and 15 days) old. There may be a newer version available.
Please download and install the latest version from: https://h2o-release.s3.amazonaws.com/h2o/latest_stable.html

1.2 Loading and Importing Data

We first need to load the dataset from csv. In real world scenario, this will be a database connection.

# Read your dataset
ames <-read.csv("../data/AmesHousing2.csv")
# ames <- AmesHousing::make_ames()

we need to convert the original dataset into an H20 object (to run the models)

ames.h2o0 <-as.h2o(ames)

head(ames$SalePrice)  # response variable
[1] 171000 139000 159000 136000 230000 190000
target <- 'SalePrice'
predictors <- setdiff(colnames(ames), target)

Let’s see what the data looks like by looking at the first few rows of the dataset.

library(kableExtra)

Attaching package: 'kableExtra'
The following object is masked from 'package:dplyr':

    group_rows
kable(head(ames), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice
117 80 9600 Reg AllPub Inside Gtl NWAmes 1Fam 2Story 6 6 1971 1971 CBlock 386 715 329 Y 930 715 1645 1 2 2 4 1 TA 7 0 Attchd 441 0 78 0 0 0 6 2010 Normal 171000
325 94 9400 Reg AllPub Corner Gtl Mitchel Duplex 2Story 6 5 1971 1971 CBlock 912 912 0 Y 912 912 1824 2 2 3 4 2 TA 8 0 NA 0 128 0 0 0 0 4 2010 Normal 139000
337 70 7700 Reg AllPub Inside Gtl Mitchel Duplex 2Story 5 2 1985 1986 PConc 1216 1216 0 Y 1216 1216 2432 4 2 5 4 2 TA 10 0 Attchd 616 200 0 0 0 0 2 2010 Normal 159000
393 60 9000 Reg AllPub FR2 Gtl NAmes Duplex 2Story 5 5 1974 1974 CBlock 896 896 0 Y 896 896 1792 2 2 3 4 2 TA 8 0 NA 0 32 45 0 0 0 9 2009 Normal 136000
590 NA 13774 IR1 AllPub Inside Gtl NWAmes 1Fam 2Story 7 7 1977 1992 PConc 476 908 432 Y 1316 972 2288 1 2 2 4 1 Gd 8 2 Attchd 520 321 72 0 0 156 11 2009 Normal 230000
667 81 9671 Reg AllPub Corner Gtl NAmes Duplex 2Story 6 5 1969 1969 CBlock 1248 1248 0 Y 1248 1296 2544 2 2 3 6 2 TA 12 0 Attchd 907 0 0 0 0 0 8 2009 Normal 190000
# head(ames)

Now let the summary of the dataset

library(gtsummary)
summary(ames)
     Order         Lot.Frontage       Lot.Area       Lot.Shape        
 Min.   :   1.0   Min.   : 21.00   Min.   :  1300   Length:2930       
 1st Qu.: 733.2   1st Qu.: 58.00   1st Qu.:  7440   Class :character  
 Median :1465.5   Median : 68.00   Median :  9436   Mode  :character  
 Mean   :1465.5   Mean   : 69.22   Mean   : 10148                     
 3rd Qu.:2197.8   3rd Qu.: 80.00   3rd Qu.: 11555                     
 Max.   :2930.0   Max.   :313.00   Max.   :215245                     
                  NA's   :490                                         
  Utilities          Lot.Config         Land.Slope        Neighborhood      
 Length:2930        Length:2930        Length:2930        Length:2930       
 Class :character   Class :character   Class :character   Class :character  
 Mode  :character   Mode  :character   Mode  :character   Mode  :character  
                                                                            
                                                                            
                                                                            
                                                                            
  Bldg.Type         House.Style         Overall.Qual     Overall.Cond  
 Length:2930        Length:2930        Min.   : 1.000   Min.   :1.000  
 Class :character   Class :character   1st Qu.: 5.000   1st Qu.:5.000  
 Mode  :character   Mode  :character   Median : 6.000   Median :5.000  
                                       Mean   : 6.095   Mean   :5.563  
                                       3rd Qu.: 7.000   3rd Qu.:6.000  
                                       Max.   :10.000   Max.   :9.000  
                                                                       
   Year.Built   Year.Remod.Add  Foundation         Bsmt.Unf.SF    
 Min.   :1872   Min.   :1950   Length:2930        Min.   :   0.0  
 1st Qu.:1954   1st Qu.:1965   Class :character   1st Qu.: 219.0  
 Median :1973   Median :1993   Mode  :character   Median : 466.0  
 Mean   :1971   Mean   :1984                      Mean   : 559.3  
 3rd Qu.:2001   3rd Qu.:2004                      3rd Qu.: 802.0  
 Max.   :2010   Max.   :2010                      Max.   :2336.0  
                                                  NA's   :1       
 Total.Bsmt.SF   BaseLivArea     Central.Air         X1st.Flr.SF    
 Min.   :   0   Min.   :   0.0   Length:2930        Min.   : 334.0  
 1st Qu.: 793   1st Qu.:   0.0   Class :character   1st Qu.: 876.2  
 Median : 990   Median : 459.5   Mode  :character   Median :1084.0  
 Mean   :1052   Mean   : 492.2                      Mean   :1159.6  
 3rd Qu.:1302   3rd Qu.: 808.0                      3rd Qu.:1384.0  
 Max.   :6110   Max.   :5644.0                      Max.   :5095.0  
 NA's   :1                                                          
  X2nd.Flr.SF      Gr.Liv.Area     Full.Bath       Half.Bath     
 Min.   :   0.0   Min.   : 334   Min.   :0.000   Min.   :0.0000  
 1st Qu.:   0.0   1st Qu.:1126   1st Qu.:1.000   1st Qu.:0.0000  
 Median :   0.0   Median :1442   Median :2.000   Median :0.0000  
 Mean   : 335.5   Mean   :1500   Mean   :1.567   Mean   :0.3795  
 3rd Qu.: 703.8   3rd Qu.:1743   3rd Qu.:2.000   3rd Qu.:1.0000  
 Max.   :2065.0   Max.   :5642   Max.   :4.000   Max.   :2.0000  
                                                                 
   Bathrooms     Bedroom.AbvGr   Kitchen.AbvGr   Kitchen.Qual      
 Min.   :0.000   Min.   :0.000   Min.   :0.000   Length:2930       
 1st Qu.:1.000   1st Qu.:2.000   1st Qu.:1.000   Class :character  
 Median :2.000   Median :3.000   Median :1.000   Mode  :character  
 Mean   :1.756   Mean   :2.854   Mean   :1.044                     
 3rd Qu.:2.500   3rd Qu.:3.000   3rd Qu.:1.000                     
 Max.   :5.000   Max.   :8.000   Max.   :3.000                     
                                                                   
 TotRms.AbvGrd      Fireplaces     Garage.Type         Garage.Area    
 Min.   : 2.000   Min.   :0.0000   Length:2930        Min.   :   0.0  
 1st Qu.: 5.000   1st Qu.:0.0000   Class :character   1st Qu.: 320.0  
 Median : 6.000   Median :1.0000   Mode  :character   Median : 480.0  
 Mean   : 6.443   Mean   :0.5993                      Mean   : 472.8  
 3rd Qu.: 7.000   3rd Qu.:1.0000                      3rd Qu.: 576.0  
 Max.   :15.000   Max.   :4.0000                      Max.   :1488.0  
                                                      NA's   :1       
  Wood.Deck.SF     Open.Porch.SF    Enclosed.Porch     X3Ssn.Porch     
 Min.   :   0.00   Min.   :  0.00   Min.   :   0.00   Min.   :  0.000  
 1st Qu.:   0.00   1st Qu.:  0.00   1st Qu.:   0.00   1st Qu.:  0.000  
 Median :   0.00   Median : 27.00   Median :   0.00   Median :  0.000  
 Mean   :  93.75   Mean   : 47.53   Mean   :  23.01   Mean   :  2.592  
 3rd Qu.: 168.00   3rd Qu.: 70.00   3rd Qu.:   0.00   3rd Qu.:  0.000  
 Max.   :1424.00   Max.   :742.00   Max.   :1012.00   Max.   :508.000  
                                                                       
  Screen.Porch    Mo.Sold          Yr.Sold     Sale.Condition    
 Min.   :  0   Min.   : 1.000   Min.   :2006   Length:2930       
 1st Qu.:  0   1st Qu.: 4.000   1st Qu.:2007   Class :character  
 Median :  0   Median : 6.000   Median :2008   Mode  :character  
 Mean   : 16   Mean   : 6.216   Mean   :2008                     
 3rd Qu.:  0   3rd Qu.: 8.000   3rd Qu.:2009                     
 Max.   :576   Max.   :12.000   Max.   :2010                     
                                                                 
   SalePrice     
 Min.   : 12789  
 1st Qu.:129500  
 Median :160000  
 Mean   :180796  
 3rd Qu.:213500  
 Max.   :755000  
                 

Let’s see a better table for the summary

 ames %>% 
 gtsummary::tbl_summary(
  ) %>% 
  gtsummary::bold_labels() %>% 
  gtsummary::as_kable_extra(
    format = "html",
    longtable = TRUE,
    linesep = ""
    ) %>%
  kableExtra::kable_styling(
      position = "left",
      latex_options = c("striped", "repeat_header"),
      stripe_color = "gray!15"
    )
Characteristic N = 2,930
Order 1,466 (733, 2,198)
Lot.Frontage 68 (58, 80)
Unknown 490
Lot.Area 9,437 (7,440, 11,556)
Lot.Shape
IR1 979 (33%)
IR2 76 (2.6%)
IR3 16 (0.5%)
Reg 1,859 (63%)
Utilities
AllPub 2,927 (100%)
NoSeWa 1 (<0.1%)
NoSewr 2 (<0.1%)
Lot.Config
Corner 511 (17%)
CulDSac 180 (6.1%)
FR2 85 (2.9%)
FR3 14 (0.5%)
Inside 2,140 (73%)
Land.Slope
Gtl 2,789 (95%)
Mod 125 (4.3%)
Sev 16 (0.5%)
Neighborhood
Blmngtn 28 (1.0%)
Blueste 10 (0.3%)
BrDale 30 (1.0%)
BrkSide 108 (3.7%)
ClearCr 44 (1.5%)
CollgCr 267 (9.1%)
Crawfor 103 (3.5%)
Edwards 194 (6.6%)
Gilbert 165 (5.6%)
Greens 8 (0.3%)
GrnHill 2 (<0.1%)
IDOTRR 93 (3.2%)
Landmrk 1 (<0.1%)
MeadowV 37 (1.3%)
Mitchel 114 (3.9%)
NAmes 443 (15%)
NoRidge 71 (2.4%)
NPkVill 23 (0.8%)
NridgHt 166 (5.7%)
NWAmes 131 (4.5%)
OldTown 239 (8.2%)
Sawyer 151 (5.2%)
SawyerW 125 (4.3%)
Somerst 182 (6.2%)
StoneBr 51 (1.7%)
SWISU 48 (1.6%)
Timber 72 (2.5%)
Veenker 24 (0.8%)
Bldg.Type
1Fam 2,425 (83%)
2fmCon 62 (2.1%)
Duplex 109 (3.7%)
Twnhs 101 (3.4%)
TwnhsE 233 (8.0%)
House.Style
1.5Fin 314 (11%)
1.5Unf 19 (0.6%)
1Story 1,481 (51%)
2.5Fin 8 (0.3%)
2.5Unf 24 (0.8%)
2Story 873 (30%)
SFoyer 83 (2.8%)
SLvl 128 (4.4%)
Overall.Qual 6 (5, 7)
Overall.Cond
1 7 (0.2%)
2 10 (0.3%)
3 50 (1.7%)
4 101 (3.4%)
5 1,654 (56%)
6 533 (18%)
7 390 (13%)
8 144 (4.9%)
9 41 (1.4%)
Year.Built 1,973 (1,954, 2,001)
Year.Remod.Add 1,993 (1,965, 2,004)
Foundation
BrkTil 311 (11%)
CBlock 1,244 (42%)
PConc 1,310 (45%)
Slab 49 (1.7%)
Stone 11 (0.4%)
Wood 5 (0.2%)
Bsmt.Unf.SF 466 (219, 802)
Unknown 1
Total.Bsmt.SF 990 (793, 1,302)
Unknown 1
BaseLivArea 460 (0, 808)
Central.Air
N 196 (6.7%)
Y 2,734 (93%)
X1st.Flr.SF 1,084 (876, 1,384)
X2nd.Flr.SF 0 (0, 704)
Gr.Liv.Area 1,442 (1,126, 1,743)
Full.Bath
0 12 (0.4%)
1 1,318 (45%)
2 1,532 (52%)
3 64 (2.2%)
4 4 (0.1%)
Half.Bath
0 1,843 (63%)
1 1,062 (36%)
2 25 (0.9%)
Bathrooms 2.00 (1.00, 2.50)
Bedroom.AbvGr
0 8 (0.3%)
1 112 (3.8%)
2 743 (25%)
3 1,597 (55%)
4 400 (14%)
5 48 (1.6%)
6 21 (0.7%)
8 1 (<0.1%)
Kitchen.AbvGr
0 3 (0.1%)
1 2,796 (95%)
2 129 (4.4%)
3 2 (<0.1%)
Kitchen.Qual
Ex 205 (7.0%)
Fa 70 (2.4%)
Gd 1,160 (40%)
Po 1 (<0.1%)
TA 1,494 (51%)
TotRms.AbvGrd 6 (5, 7)
Fireplaces
0 1,422 (49%)
1 1,274 (43%)
2 221 (7.5%)
3 12 (0.4%)
4 1 (<0.1%)
Garage.Type
2Types 23 (0.8%)
Attchd 1,731 (62%)
Basment 36 (1.3%)
BuiltIn 186 (6.7%)
CarPort 15 (0.5%)
Detchd 782 (28%)
Unknown 157
Garage.Area 480 (320, 576)
Unknown 1
Wood.Deck.SF 0 (0, 168)
Open.Porch.SF 27 (0, 70)
Enclosed.Porch 0 (0, 0)
X3Ssn.Porch 0 (0, 0)
Screen.Porch 0 (0, 0)
Mo.Sold 6 (4, 8)
Yr.Sold
2006 625 (21%)
2007 694 (24%)
2008 622 (21%)
2009 648 (22%)
2010 341 (12%)
Sale.Condition
Abnorml 190 (6.5%)
AdjLand 12 (0.4%)
Alloca 24 (0.8%)
Family 46 (1.6%)
Normal 2,413 (82%)
Partial 245 (8.4%)
SalePrice 160,000 (129,500, 213,500)
1 Median (Q1, Q3); n (%)

1.3 Variable separation

The dataset contains both numeric and categorical variables. We need to separate them for further analysis. This is especially important for feature engineering.

num_vars<-colnames(ames[sapply(ames, is.numeric) == TRUE])
cat_vars<-colnames(ames[sapply(ames, is.character) == TRUE])

1.4 Side note on text and variable outputs

There are two ways we can use the variables from the r code in the text. The first one is by using stmt output using print and the other is using r output.

1.4.1 Statement output

stmt <- paste("There are", length(num_vars), "numerical features and", length(cat_vars), "categorical features" )
print(stmt, na.print = NULL)
[1] "There are 29 numerical features and 12 categorical features"

1.4.2 r output

There are 29 numerical features and 12 categorical features in this dataset.

2 Resampling Methods

2.1 Test Set Approach

There are multiple ways to split your data in R (four ways shown below)Simple Random Sampling is one of them.

2.1.1 Using base R

set.seed(123)  # for reproducibility
index_1 <- sample(1:nrow(ames), round(nrow(ames) * 0.8))
train_1 <- ames[index_1, ]
test_1  <- ames[-index_1, ]

# kable(head(train), "html") %>% kable_styling("striped") %>% scroll_box(width = "100%")
print(dim(train_1))
[1] 2344   41
print(dim(test_1))
[1] 586  41

2.1.2 Using caret package

set.seed(123)  # for reproducibility
index_2 <- createDataPartition(ames$SalePrice, p = 0.7, 
                               list = FALSE)
train_2 <- ames[index_2, ]
test_2  <- ames[-index_2, ]

2.1.3 Using rsample package

set.seed(123)  # for reproducibility
split_1  <- initial_split(ames, prop = 0.7)
train_3  <- training(split_1)
test_3   <- testing(split_1)

2.1.4 Using h2o package

split_2 <- h2o.splitFrame(ames.h2o0, ratios = 0.7, 
                          seed = 123)
train_4 <- split_2[[1]]
test_4  <- split_2[[2]]

2.2 The Validation Set Approach

Draw a sample of size nrow(ames) from the numbers 1 to 3 (with replacement). You want approximately 70% of the sample to be 1 and the remaining 30% to be equally split between 2 and 3.

set.seed(123)
part <-sample(1:3, size=nrow(ames), prob=c(0.7, 0.15, 0.15), replace=TRUE)

We can also create training, validation, and testing sets from the original data frame directly.

train_5 <-ames[part == 1, ] #subset ames to training indices only
valid_5 <-ames[part == 2, ]
test_5 <-ames[part == 3, ] 

2.3 Stratified Random Sampling

The easiest way to perform stratified sampling on a response variable is to use the rsample package, where you specify the response variable to strata. The following illustrates that in our original employee attrition data we have an imbalanced response (No: 84%, Yes: 16%). By enforcing stratified sampling, both our training and testing sets have approximately equal response distributions.

Let’s import Job Attrition data from the rsample package.

# Job attrition data
library(rsample)
library(modeldata)
churn <- attrition %>% 
  mutate_if(is.ordered, .funs = factor, ordered = FALSE)
churn.h2o <- as.h2o(churn)

2.3.1 Original response distribution

t(table(churn$Attrition)) %>% prop.table() %>% round(4) %>% kbl(format = "html",table.attr = "style='width:30%;'") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") 
No Yes
0.8388 0.1612

2.3.2 Stratified sampling with the rsample package

set.seed(123)
split_strat  <- initial_split(churn, prop = 0.7, 
                              strata = "Attrition")
train_strat  <- training(split_strat)
test_strat   <- testing(split_strat)

split <- initial_split(ames, prop = 0.7, 
                       strata = "SalePrice")
ames_train  <- training(split)
ames_test   <- testing(split)

2.3.3 Consistent response ratio between train & test

library(dplyr)
library(kableExtra)

train_strat$Attrition %>%
  table() %>%
  t() %>%
  prop.table() %>%
  round(4) %>%
  knitr::kable(format = "html", table.attr = "style='width:30%;'") %>%
  kableExtra::kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = F, 
    position = "center"
  )
No Yes
0.8395 0.1605

Now Test Dataframe

t(table(test_strat$Attrition)) %>% prop.table() %>% round(4) %>% knitr::kable(format = "html", table.attr = "style='width:30%;'") %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") 
No Yes
0.8371 0.1629

3 Feature Engineering

Feature engineering is the process of transforming raw data into informative features that better represent the underlying patterns in the data, aiming to improve the performance of machine learning models. By leveraging domain knowledge, statistical techniques, and creative transformations, practitioners craft new features, select relevant ones, or refine existing variables to make algorithms more effective and efficient. This process often plays a crucial role in enhancing model accuracy, generalization, and interpretability (Heaton 2016).

3.1 Loading packages

library(dplyr)    # for data manipulation
library(ggplot2)  # for awesome graphics
library(visdat)   # for additional visualizations

# Feature engineering packages
library(caret)    # for various ML tasks
library(recipes)  # for feature engineering tasks

3.2 Transformation

Let’s create a histogram of the SalePrice variable to understand the distribution of the variable.

hist(ames$SalePrice, col ='blue')

It seems like the variable is skewed. We can apply a transformation to normalize the data to stick to the normality assumption. Let’s also calculate the moments of the variable to understand how much data is skewed. Also, this looks ugly; can we make it look better?

library(ggplot2)
# color =  color of the histogram border
library(ggplot2)
library(scales)

ggplot(ames, aes(x = SalePrice)) +
  geom_histogram(aes(y = after_stat(density)), fill = "#336699", color = "black", alpha = 0.65, bins = 25) + 
  labs(x = "Sale Price", y = "Density", title="Histogram of Sale Price") +
  scale_x_continuous(labels = label_comma(accuracy = 1)) +
  # scale_y_continuous(labels = label_comma(accuracy = 1)) +
  theme_light() +
  theme(
    plot.title = element_text(hjust = 0, face = "bold", size = 14),  # Left-align (hjust = 0) and bold the title
    axis.title.x = element_text(size = 10, color = "black", face="bold"),
    axis.title.y = element_text(size = 10, color = "black", face="bold"),
    axis.text.x = element_text(size = 8, color = "black"),
    axis.text.y = element_text(size = 8, color = "black")
  )

# install.packages("moments")
library(moments)
saleprice_skew <- skewness(ames$SalePrice)

Skewness is 1.7426074 (>1), which indicates that data is positively or right skewed.

3.3 Variable Transformation

3.3.1 Log Transformation

Transform the variable to a log transformation (natural log of variable)

ln_sales <-log(ames$SalePrice)

hist(ln_sales, col='blue', breaks = 40, main = "Histogram of Log Sale Price", xlab = "Log Sale Price", ylab = "Frequency")

log_sales = skewness(ln_sales)

The distribution appears to conform to a normal distribution. The skewness value is -0.0147859. The transformation has caused the distribution to become slightly negatively skewed. However, the variable’s distribution (output) is now closer to 0.

3.3.2 Square Root Transformation

sqt_sales <- sqrt(ames$SalePrice)
hist(sqt_sales, col='blue')

sq_skewness<- skewness(sqt_sales)

The distribution is not as normal as that of log transformation. The skewness value is 0.8843168, which means data is moderately positively skewed and performs worse than log transformation.

3.3.3 Cube-Root Transformation

cbt_sales <- (ames$SalePrice)^(1/3)
hist(cbt_sales, col='blue')

cbt_skewnes <- skewness(cbt_sales)

The data distribution is worse than log transformation but better than square root. The skewness value is 0.6069153, meaning data is positively skewed and performs better than square root (you may use log transformation).

3.3.4 Box-Cox Transformation

# | message: false
# load package
library(sjPlot)

Attaching package: 'sjPlot'
The following object is masked from 'package:ggplot2':

    set_theme
library(sjmisc)
library(sjlabelled)

Attaching package: 'sjlabelled'
The following object is masked from 'package:ggplot2':

    as_label
The following object is masked from 'package:dplyr':

    as_label
m.ols <- lm(SalePrice ~ Gr.Liv.Area, data=ames)
ols_summ <- summary(m.ols)

tab_model(m.ols)
  Sale Price
Predictors Estimates CI p
(Intercept) 13289.63 6878.48 – 19700.78 <0.001
Gr Liv Area 111.69 107.64 – 115.75 <0.001
Observations 2930
R2 / R2 adjusted 0.500 / 0.499
library(MASS)

Attaching package: 'MASS'
The following object is masked from 'package:gtsummary':

    select
The following object is masked from 'package:dplyr':

    select
bc <- boxcox(m.ols)

str(bc)
List of 2
 $ x: num [1:100] -2 -1.96 -1.92 -1.88 -1.84 ...
 $ y: num [1:100] -14038 -13806 -13577 -13351 -13127 ...
(bc.power <- bc$x[which.max(bc$y)])
[1] 0.1818182

y_{transform}= (y^{\lambda}-1)/\lambda We need to transform the response variable accordingly and re-compute the linear model with this transformed variable. We can write a function corresponding to the formula:

BCTransform <- function(y, lambda=0) {
  if (lambda == 0L) { log(y) }
  else { (y^lambda - 1) / lambda }
}

#Reverse Transformation
BCTransformInverse <- function(yt, lambda=0) {
  if (lambda == 0L) { exp(yt) }
  else { exp(log(1 + lambda * yt)/lambda) }
}

ames$SalePrice.bc <- BCTransform(ames$SalePrice, bc.power)
hist(ames$SalePrice.bc, breaks=30); rug(ames$SalePrice.bc)

SPbc_skewness <- skewness(ames$SalePrice.bc)

summary(m.ols.bc <- lm(SalePrice.bc ~ Gr.Liv.Area, data=ames))

Call:
lm(formula = SalePrice.bc ~ Gr.Liv.Area, data = ames)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.5363  -1.3894   0.1928   1.4254   8.5499 

Coefficients:
             Estimate Std. Error t value Pr(>|t|)    
(Intercept) 3.594e+01  1.497e-01  240.14   <2e-16 ***
Gr.Liv.Area 5.085e-03  9.456e-05   53.78   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 2.587 on 2928 degrees of freedom
Multiple R-squared:  0.4969,    Adjusted R-squared:  0.4967 
F-statistic:  2892 on 1 and 2928 DF,  p-value: < 2.2e-16

3.4 Feature Scaling

Let’s look at one of the continuous variables, Living Area. We want to scale the feature. In general, feature scaling refers to the process of standardizing the range of data features. Since the content of raw data values varies widely, in some machine learning algorithms, objective functions will not work properly without normalization.

We can perform Min-Max scaling or Z-score scaling. Min-Max Scaling transforms the data to a range between 0 and 1. Z-score scaling transforms the data into a mean of 0 and a standard deviation of 1.

The formula for Min-Max scaling is:

\begin{align} x_{scaled} = \frac{x - x_{min}}{x_{max} - x_{min}} \end{align}

The formula for Z-score scaling is:

\begin{align} z = \frac{x - \mu}{\sigma} \end{align}

where x is the original feature vector, \mu is the mean of that feature vector, and \sigma is its standard deviation.

G <-ames$Gr.Liv.Area
A <-ames$Lot.Area
max_A <- max(A)
min_A <- min(A)
max_G <- max(G)
min_G <- min(G)

The range of Gr.Liv.Area is 334 to 5642. The range of Lot.Area is 1300 to 215245.

3.4.1 Min-Max Scaling

Gr.Liv.Area_N = (G-min_G)/(max_G-min_G)
Lot.Area_N = (A-min_A)/(max_A-min_A)

minmaxscaler <-cbind.data.frame(ames, Gr.Liv.Area_N, Lot.Area_N)
head(minmaxscaler)
  Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope
1   117           80     9600       Reg    AllPub     Inside        Gtl
2   325           94     9400       Reg    AllPub     Corner        Gtl
3   337           70     7700       Reg    AllPub     Inside        Gtl
4   393           60     9000       Reg    AllPub        FR2        Gtl
5   590           NA    13774       IR1    AllPub     Inside        Gtl
6   667           81     9671       Reg    AllPub     Corner        Gtl
  Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built
1       NWAmes      1Fam      2Story            6            6       1971
2      Mitchel    Duplex      2Story            6            5       1971
3      Mitchel    Duplex      2Story            5            2       1985
4        NAmes    Duplex      2Story            5            5       1974
5       NWAmes      1Fam      2Story            7            7       1977
6        NAmes    Duplex      2Story            6            5       1969
  Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air
1           1971     CBlock         386           715         329           Y
2           1971     CBlock         912           912           0           Y
3           1986      PConc        1216          1216           0           Y
4           1974     CBlock         896           896           0           Y
5           1992      PConc         476           908         432           Y
6           1969     CBlock        1248          1248           0           Y
  X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms
1         930         715        1645         1         2         2
2         912         912        1824         2         2         3
3        1216        1216        2432         4         2         5
4         896         896        1792         2         2         3
5        1316         972        2288         1         2         2
6        1248        1296        2544         2         2         3
  Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type
1             4             1           TA             7          0      Attchd
2             4             2           TA             8          0        <NA>
3             4             2           TA            10          0      Attchd
4             4             2           TA             8          0        <NA>
5             4             1           Gd             8          2      Attchd
6             6             2           TA            12          0      Attchd
  Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
1         441            0            78              0           0
2           0          128             0              0           0
3         616          200             0              0           0
4           0           32            45              0           0
5         520          321            72              0           0
6         907            0             0              0           0
  Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice SalePrice.bc
1            0       6    2010         Normal    171000     43.68317
2            0       4    2010         Normal    139000     41.86486
3            0       2    2010         Normal    159000     43.03681
4            0       9    2009         Normal    136000     41.67733
5          156      11    2009         Normal    230000     46.40657
6            0       8    2009         Normal    190000     44.63443
  Gr.Liv.Area_N Lot.Area_N
1     0.2469857 0.03879502
2     0.2807084 0.03786020
3     0.3952524 0.02991423
4     0.2746797 0.03599056
5     0.3681236 0.05830470
6     0.4163527 0.03912688

3.4.2 Standardize (z scores)

z_Gr.Liv.Area <- (ames$Gr.Liv.Area -mean(ames$Gr.Liv.Area))/sd(ames$Gr.Liv.Area)
z_Lot.Area <-(ames$Lot.Area -mean(ames$Lot.Area))/sd(ames$Lot.Area)

zscore <-cbind.data.frame(ames, z_Gr.Liv.Area, z_Lot.Area)
head(zscore)
  Order Lot.Frontage Lot.Area Lot.Shape Utilities Lot.Config Land.Slope
1   117           80     9600       Reg    AllPub     Inside        Gtl
2   325           94     9400       Reg    AllPub     Corner        Gtl
3   337           70     7700       Reg    AllPub     Inside        Gtl
4   393           60     9000       Reg    AllPub        FR2        Gtl
5   590           NA    13774       IR1    AllPub     Inside        Gtl
6   667           81     9671       Reg    AllPub     Corner        Gtl
  Neighborhood Bldg.Type House.Style Overall.Qual Overall.Cond Year.Built
1       NWAmes      1Fam      2Story            6            6       1971
2      Mitchel    Duplex      2Story            6            5       1971
3      Mitchel    Duplex      2Story            5            2       1985
4        NAmes    Duplex      2Story            5            5       1974
5       NWAmes      1Fam      2Story            7            7       1977
6        NAmes    Duplex      2Story            6            5       1969
  Year.Remod.Add Foundation Bsmt.Unf.SF Total.Bsmt.SF BaseLivArea Central.Air
1           1971     CBlock         386           715         329           Y
2           1971     CBlock         912           912           0           Y
3           1986      PConc        1216          1216           0           Y
4           1974     CBlock         896           896           0           Y
5           1992      PConc         476           908         432           Y
6           1969     CBlock        1248          1248           0           Y
  X1st.Flr.SF X2nd.Flr.SF Gr.Liv.Area Full.Bath Half.Bath Bathrooms
1         930         715        1645         1         2         2
2         912         912        1824         2         2         3
3        1216        1216        2432         4         2         5
4         896         896        1792         2         2         3
5        1316         972        2288         1         2         2
6        1248        1296        2544         2         2         3
  Bedroom.AbvGr Kitchen.AbvGr Kitchen.Qual TotRms.AbvGrd Fireplaces Garage.Type
1             4             1           TA             7          0      Attchd
2             4             2           TA             8          0        <NA>
3             4             2           TA            10          0      Attchd
4             4             2           TA             8          0        <NA>
5             4             1           Gd             8          2      Attchd
6             6             2           TA            12          0      Attchd
  Garage.Area Wood.Deck.SF Open.Porch.SF Enclosed.Porch X3Ssn.Porch
1         441            0            78              0           0
2           0          128             0              0           0
3         616          200             0              0           0
4           0           32            45              0           0
5         520          321            72              0           0
6         907            0             0              0           0
  Screen.Porch Mo.Sold Yr.Sold Sale.Condition SalePrice SalePrice.bc
1            0       6    2010         Normal    171000     43.68317
2            0       4    2010         Normal    139000     41.86486
3            0       2    2010         Normal    159000     43.03681
4            0       9    2009         Normal    136000     41.67733
5          156      11    2009         Normal    230000     46.40657
6            0       8    2009         Normal    190000     44.63443
  z_Gr.Liv.Area  z_Lot.Area
1     0.2874520 -0.06953307
2     0.6415507 -0.09491373
3     1.8442990 -0.31064928
4     0.5782481 -0.14567503
5     1.5594376  0.46016117
6     2.0658580 -0.06052294

3.5 Feature Construction

3.5.1 Unsupervised Binning

Automatic Binning can be used to determine the number of bins to be created. The bins are created based on the distribution of the data. The bins are created so that the data points in each bin are more similar and the data points in different bins are less similar.

summary(ames$Overall.Qual)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
  1.000   5.000   6.000   6.095   7.000  10.000 
count(ames, Overall.Qual)
   Overall.Qual   n
1             1   4
2             2  13
3             3  40
4             4 226
5             5 825
6             6 732
7             7 602
8             8 350
9             9 107
10           10  31
Overall.Qual_Bin <- cut(ames$Overall.Qual, breaks=4)
head(Overall.Qual_Bin)
[1] (5.5,7.75] (5.5,7.75] (3.25,5.5] (3.25,5.5] (5.5,7.75] (5.5,7.75]
Levels: (0.991,3.25] (3.25,5.5] (5.5,7.75] (7.75,10]

3.5.2 Manual binning

We need to know the min and max values. For example, say min=1 and max=10 then the bins will be 0-5, 5-6, 6-7, 7-10.

Overall.Qual_MBin <- cut(ames$Overall.Qual, breaks = c(0, 5, 6, 7, 10))
barchart(Overall.Qual_MBin, main = "Overall Quality Manual Binning")

Sometimes, features will contain levels that have very few observations. For example, there are 28 unique neighborhoods represented in the Ames housing data, but several of them have fewer observations than the others.

count(train_1, Neighborhood) %>% arrange(n) %>% head() %>% kbl() %>% kable_styling(bootstrap_options = c("striped", "hover", "condensed", "responsive"),full_width = F, position = "center") 
Neighborhood n
Landmrk 1
GrnHill 2
Blueste 8
Greens 8
NPkVill 15
Veenker 21

In the above examples, we may want to collapse all levels observed in less than 10% of the training sample into an “other” category. We can use step_other() to do so. For example, Screen.Porch has 92% values recorded as zero (zero square footage meaning no screen porch), and the remaining 8% have unique dispersed values.

library(flextable)

Attaching package: 'flextable'
The following object is masked from 'package:gtsummary':

    continuous_summary
The following objects are masked from 'package:kableExtra':

    as_image, footnote
arrange(count(train_1, Screen.Porch),n) %>% flextable()

Screen.Porch

n

40

1

53

1

60

1

63

1

64

1

80

1

88

1

90

1

94

1

95

1

99

1

104

1

109

1

111

1

113

1

117

1

119

1

121

1

122

1

123

1

126

1

128

1

130

1

135

1

141

1

143

1

145

1

150

1

152

1

154

1

162

1

163

1

164

1

165

1

166

1

178

1

184

1

185

1

190

1

197

1

198

1

201

1

217

1

220

1

221

1

222

1

231

1

234

1

252

1

260

1

263

1

264

1

270

1

271

1

276

1

280

1

287

1

291

1

312

1

342

1

348

1

374

1

385

1

396

1

440

1

490

1

576

1

92

2

108

2

116

2

138

2

140

2

170

2

175

2

176

2

182

2

196

2

204

2

227

2

240

2

255

2

259

2

266

2

322

2

100

3

112

3

115

3

142

3

147

3

153

3

156

3

195

3

210

3

225

3

256

3

155

4

189

4

288

4

160

5

180

5

224

5

200

6

216

7

120

8

144

8

168

8

192

9

0

2,137

3.5.3 Standardize and Normalize

# Normalize all numeric columns
recipe(SalePrice ~ ., data = train_1) %>%
    step_YeoJohnson(all_numeric())    
── Recipe ──────────────────────────────────────────────────────────────────────
── Inputs 
Number of variables by role
outcome:    1
predictor: 40
── Operations 
• Yeo-Johnson transformation on: all_numeric()
##standardize

# train_1 %>%
#     step_center(all_numeric(), -all_outcomes()) %>%
#     step_scale(all_numeric(), -all_outcomes())

3.5.4 Lump levels for two features

# lumping <- recipe(SalePrice ~ ., data = train_1) %>%
#   step_other(Neighborhood, threshold = 0.01, 
#              other = "other") %>%
#   step_other(Screen.Porch, threshold = 0.1, 
#              other = ">0")

lumping <- recipe(SalePrice ~ ., data = train_1) %>%
  step_mutate(Screen.Porch = factor(ifelse(Screen.Porch > 0, ">0", "0"))) %>%
  step_other(Neighborhood, threshold = 0.01, other = "other") %>%
  step_other(Screen.Porch, threshold = 0.1, other = ">0")

3.5.4.1 Modified Code

  1. recipe(Sale_Price ~ ., data = train_1):
  2. Initialization remains similar but uses Sale_Price as the outcome instead of SalePrice.
  3. step_mutate(Screen.Porch = factor(ifelse(Screen.Porch > 0, ">0", "0"))):
  4. This new step converts the numeric Screen.Porch variable into a binary categorical variable. If the value in Screen.Porch is greater than 0, it’s labeled as “>0”, otherwise “0”.
  5. step_other(Neighborhood, threshold = 0.01, other = "other"):
  6. This step remains unchanged from the original code.
  7. step_other(Screen.Porch, threshold = 0.1, other = ">0"):
  8. Similar to the original code, but applied on the newly created binary categorical Screen.Porch.

3.5.5 Apply this blueprint

apply_2_training <- prep(lumping, training = train_1) %>%
  bake(train_1)

3.5.6 New distribution of Neighborhood

count(apply_2_training, Neighborhood) %>% arrange(n) %>% flextable()

Neighborhood

n

BrDale

27

MeadowV

34

ClearCr

36

StoneBr

39

SWISU

41

NoRidge

54

Timber

58

IDOTRR

74

other

77

Crawfor

88

BrkSide

89

Mitchel

99

NWAmes

103

SawyerW

103

Sawyer

126

NridgHt

130

Gilbert

133

Edwards

141

Somerst

142

OldTown

183

CollgCr

213

NAmes

354

3.5.7 New distribution of Screen.Porch

count(apply_2_training, Screen.Porch) %>% arrange(n) %>% flextable()

Screen.Porch

n

>0

207

0

2,137

3.5.8 Variable Encoding

The categorical variables are very often found in data while conducting data analysis and ML(machine learning). The Data which can be classified into categories or groups, such as colors or job titles is generally called as categorical data. The categorical variables must be encoded into numerical values in order to be used in statistical analysis or ML models.

Warning: Dummies package is no longer available in R repository.

3.5.8.1 Using fastDummies package

We can create dummy variables for all cat variables in the dataset

library(fastDummies)
ames_train_dummies <- dummy_cols(train_1, select_columns = NULL, remove_first_dummy = TRUE)
ames_test_dummies <- dummy_cols(test_1, select_columns = NULL, remove_first_dummy = TRUE)

3.5.8.2 One-hot encoding (caret)

# Create dummy variables transformation
dummies <- dummyVars(SalePrice ~ ., data = ames)

# Apply transformation to dataset
ames_dummies <- data.frame(predict(dummies, newdata = ames))

3.5.8.3 Label encoding (caret)

# Identify columns with class 'factor'
factor_columns <- sapply(ames, is.factor)

# Apply label encoding on those columns
ames[factor_columns] <- lapply(ames[factor_columns], function(x) as.numeric(as.factor(x)))

Here, we only had one class variable for multiple class variables, convert all into factors first. Apply the above R code to simultaneously get various categories.

3.6 Missing Values

3.6.1 #Missing data visualization

Let’s look at all missingness in Ames dataset

library(visdat)

vis_miss(AmesHousing::ames_raw, cluster = TRUE)

Missing value imputation Features where null / missing values are less than 40% for categorical variables are replaced with mode which is value whose frequency is maximum.

Mode = function(x){
  ta = table(x)
  tam = max(ta)
  mod = names(ta)[ta==tam]
  return(mod)
}

Identify numeric variables and treat them with mean value if any missing / null values exists

for (col_name in colnames(ames[sapply(ames, is.factor) == TRUE])) {
  
  if (sum(is.na(ames[[col_name]])) > 0) {
    ames[col_name][is.na(ames[col_name])] <- Mode(ames[[col_name]])
    stmt <- paste('Null values of', col_name, ' feature has been replaced by mode value ', 
                  Mode(ames[[col_name]]))
    print(stmt, na.print = NULL)
  }
}

\begin{align} log(SalePrice) = \beta_{o}+\beta_{1}X_{1}\\ SalePrice = e^{\beta_{o}+\beta_{1}X_{1}}\\ SalePrice = B e^{\beta_{1}X_{1}}\\ \end{align}

4 References

Heaton, Jeff. 2016. “An Empirical Analysis of Feature Engineering for Predictive Modeling.” In SoutheastCon 2016, 1–6. Norfolk, VA, USA: IEEE. https://doi.org/10.1109/SECON.2016.7506650.